Chapter 9: What we have covered so far (and a bit more)


In this chapter, we will work our way through a concise review of the Python functionality we have covered so far. Throughout this chapter, we will work with a interesting, yet not too large dataset, namely the well-known Arabian nights. Alf Laylah Wa Laylah, the Stories of One Thousand and One Nights is a collection of folk tales, collected over many centuries by various authors, translators, and scholars across West, Central and South Asia and North Africa. It forms a huge narrative wheel with an overarching plot, created by the frame story of Shahrazad.

The stories begin with the tale of king Shahryar and his brother, who, having both been deceived by their respective Sultanas, leave their kingdom, only to return when they have found someone who — in their view — was wronged even more. On their journey the two brothers encounter a huge jinn who carries a glass box containing a beautiful young woman. The two brothers hide as quickly as they can in a tree. The jinn lays his head on the girl’s lap and as soon as he is asleep, the girl demands the two kings to make love to her or else she will wake her ‘husband’. They reluctantly give in and the brothers soon discover that the girl has already betrayed the jinn ninety-eight times before. This exemplar of lust and treachery strengthens the Sultan’s opinion that all women are wicked and not to be trusted.

When king Shahryar returns home, his wrath against women has grown to an unprecedented level. To temper his anger, each night the king sleeps with a virgin only to execute her the next morning. In order to make an end to this cruelty and save womanhood from a "virgin scarcity", Sharazad offers herself as the next king’s bride. On the first night, Sharazad begins to tell the king a story, but she does not end it. The king’s curiosity to know how the story ends, prevents him from executing Shahrazad. The next night Shahrazad finishes her story, and begins a new one. The king, eager to know the ending of this tale as well, postpones her execution once more. Using this strategy for One Thousand and One Nights in a labyrinth of stories-within-stories-within-stories, Shahrazad attempts to gradually move the king’s cynical stance against women towards a politics of love and justice (see Marina Warner’s Stranger Magic (2013) in case you're interested).

The first European version of the Nights was translated into French by Antoine Galland. Many translations (in different languages) followed, such as the (heavily criticized) English translation by Sir Richard Francis Burton entitled The Book of the Thousand and a Night (1885). This version is freely available from the Gutenberg project (see here), and will be the one we will explore here.

Files and directories

In the notebooks we use, there is a convenient way to quickly inspect the contents of a folder using the ls command. Our Arabian nights are contained under the general data folder:


In [ ]:
ls data/arabian_nights

As you can see, this folder holds a number of plain text files, ending in the .txt extension. Let us open a random file:


In [ ]:
f = open('data/arabian_nights/848.txt', 'r')
text = f.read()
f.close()
print(text[:500])

Here, we use the open() function to create a file object f, which we can use to access the actual text content of the file. Make sure that you do not pass the 'w' parameter ("write") to open(), instead of 'r' ("read"), since this would overwrite and thus erase the existing file. After assigning the string returned by f.read() to the variable text, we print the 500 first characters of text to get an impression of what it contains, using simple string indexing ([:500]). Don't forget to close the file again after you have opened or strange things could happen to your file! One little trick which is commonly used to avoid having to explicitly open and close your file is a with block (mind the indentation):


In [ ]:
with open('data/arabian_nights/848.txt', 'r') as f:
    text = f.read()
print(text[:500])

This code block does exactly the same thing as the previous one but saves you some typing. In this chapter we would like to work with all the files in the arabian_nights directory. This is where loops come in handy of course, since what we really would like to do, is iterate over the contents of the directory. Accessing these contents in Python is easy, but requires importing some extra functionality. In this case, we need to import the os module, which contains all functionality related to the 'operating system' of your machine, such as directory information:


In [ ]:
import os

Using the dot-syntax (os.xxx), we can now access all functions that come with this module, such as listdir(), which returns a list of the items which are included under a given directory


In [ ]:
filenames = os.listdir('data/arabian_nights')
print(len(filenames))
print(filenames[:20])

The function os.listdir() returns a list of strings, representing the filenames contained under a directory.

Quiz

  1. In Burton's translation some of the 1001 nights are missing. How many?
  2. Can you come up with a clever way to find out which nights are missing? Hint: a counting loop and some string casting might be useful here!

In [ ]:
# your code goes here

With os.listdir(), you need to make sure that you pass the correct path to an existing directory:


In [ ]:
os.listdir('data/belgian_nights')

It might therefore be convenient to check whether a directory actually exists in a given location:


In [ ]:
print(os.path.isdir('data/arabian_nights'))
print(os.path.isdir('data/belgian_nights'))

The second directory, naturally, does not exist and isdir() evaluates to False in this case. Creating a new (and thus empty) directory is also easy using os:


In [ ]:
os.mkdir('belgian_nights')

We can see that it lives in the present working directory now, by typing ls again:


In [ ]:
ls

Or we use Python:


In [ ]:
print(os.path.isdir('belgian_nights'))

Removing directories is also easy, but PLEASE watch out, sometimes it is too easy: if you remove a wrong directory in Python, it will be gone forever... Unlike other applications, Python does not keep a copy of it in your Trash and it does not have a Ctrl-Z button. Please watch out with what you do, since with great power comes great responsiblity! Removing the entire directory which we just created can be done as follows:


In [ ]:
import shutil
shutil.rmtree('belgian_nights')

And lo behold: the directory has disappeared again:


In [ ]:
print(os.path.isdir('belgian_nights'))

Here, we use the rmtree() command to remove the entire directory in a recursive way: even if the directory isn't empty and contains files and subfolders, we will remove all of them. The os module also comes with a rmdir() but this will not allow you to remove a directory which is not empty, as becomes clear in the OSError raised below:


In [ ]:
os.rmdir('data/arabian_nights')

The folder contains things and therefore cannot be removed using this function. There are, of course, also ways to remove individual files or check whether they exist:


In [ ]:
os.mkdir('belgian_nights')
f = open('belgian_nights/1001.txt', 'w')
f.write('Content')
f.close()
print(os.path.exists('belgian_nights/1001.txt'))
os.remove('belgian_nights/1001.txt')
print(os.path.exists('belgian_nights/1001.txt'))

Here, we created a directory, wrote a new file to it (1001.txt), and removed it again. Using os.path.exists() we monitored at which point the file existed. Finally, the shutil module also ships with a useful copyfile() function which allows you to copy files from one location to another, possibly with another name. To copy night 66 to the present directory, for instance, we could do:


In [ ]:
shutil.copyfile('data/arabian_nights/66.txt', 'new_66.txt')

Indeed, we have added an exact copy of night 66 to our present working directory:


In [ ]:
ls

We can safely remove it again:


In [ ]:
os.remove('new_66.txt')

Paths

The paths we have used so far are 'relative' paths, in the sense that they are relative to the place on our machine from which we execute our Python code. Absolute paths can also be retrieved and will differ on each computer, because they typically include user names etc:


In [ ]:
os.path.abspath('data/arabian_nights/848.txt')

While absolute paths are longer to type, they have the advantage that they can be used anywhere on your computer (i.e. irrespective of where you run your code from). Paths can be tricky. Suppose that we would like to open one of our filenames:


In [ ]:
filenames = os.listdir('data/arabian_nights')
random_filename = filenames[9]
with open(random_filename, 'r') as f:
    text = f.read()
print(text[:500])

Python throws a FileNotFoundError, complaining that the file we wish to open does not exist. This situation stems from the fact that os.listdir() only returns the base name of a given file, and not an entire (absolute or relative) path to it. To properly access the file, we must therefore not forget to include the rest of the path again:


In [ ]:
filenames = os.listdir('data/arabian_nights')
random_filename = filenames[9]
with open('data/arabian_nights/'+ random_filename, 'r') as f:
    text = f.read()
print(text[:500])

Apart from os.listdir() there are a number of other common ways to obtain directory listings in Python. Using the glob module for instance, we can easily access the full relative path leading to our Arabian Nights:


In [ ]:
import glob
filenames = glob.glob('data/arabian_nights/*')
print(filenames[:10])

The asterisk (*) in the argument passed to glob.glob() is worth noting here. Just like with regular expressions, this asterisk is a sort of wildcard which will match any series of characters (i.e. the filenames under arabian_nights). When we exploit this wildcard syntax, glob.glob() offers another distinct advantage: we can use it to easily filter out filenames which we are not interested in:


In [ ]:
filenames = glob.glob('data/arabian_nights/*.txt')
print(filenames[:10])

Interestingly, the command in this code block will only load filenames that end in ".txt". This is interesting when we would like to ignore other sorts of junk files etc. that might be present in a directory. To replicate similar behaviour with os.listdir(), we would have needed a typical for-loop, such as:


In [ ]:
filenames = []
for fn in os.listdir('data/arabian_nights'):
    if fn.endswith('.txt'):
        filenames.append(fn)
print(filenames[:10])

Or for you stylish coders out there, you can show off with a list comprehension:


In [ ]:
filenames = [fn for fn in os.listdir('data/arabian_nights') if fn.endswith('.txt')]

However, when using glob.glob(), you might sometimes want to be able to extract a file's base name again. There are several solutions to this:


In [ ]:
filenames = glob.glob('data/arabian_nights/*.txt')
fn = filenames[10]

# simple string splitting:
print(fn.split('/')[-1])

# using os.sep:
print(fn.split(os.sep)[-1])

# using os.path:
print(os.path.basename(fn))

Both os.sep and os.path.basename have the advantage that they know what separator is used for paths in the operating system, so you don't need to explicitly code it like in the first solution. Separators differ between Windows (backslash) and Mac/Linux (forward slash).

Finally, sometimes, you might be interested in all the subdirectories of a particular directory (and all the subdirectories of these subdirectories etc.). Parsing such deep directory structures can be tricky, especially if you do not know how deep a directory tree might run. You could of course try stacking multiple loops using os.listdir(), but a more convenient way is os.walk():


In [ ]:
for root, directory, filename in os.walk("data"):
    print(filename)

As you can see, os.walk() allows you to efficiently loop over the entire tree. As always, don't forget that help is right around the corner in your notebooks. Using help(), you can quickly access the documentation of modules and their functions etc. (but only after you have imported the modules first!).


In [ ]:
help(os.walk)

Quiz

In the next part of this chapter, we will need a way to sort our stories from the first, to the very last night. For our own convenience we will use a little hack for this. In this quiz, we would like you to create a new folder under data directory, called '1001'. You should copy all the original files from arabian_nights to this new folder, but give the files a new name, prepending zeros to filename until all nights have four digits in their name. 1001.txt stays 1001.txt, for instance, but 66.txt becomes 0066.txt and 2.txt becomes 0002.txt etc. This will make sorting the nights easier below. For this quiz you could for instance use a for loop in combination with a while loop (but don't get stuck in endless loops...)


In [ ]:
# your quiz code

Parsing files

Using the code from the previous quiz, it is now trivial to sort our nights sequentially on the basis of their actual name (i.e. a string variable):


In [ ]:
for fn in sorted(os.listdir('data/1001')):
    print(fn)

Using the old filenames, this was not possible directly, because of the way Python sorts strings of unequal lengths. Note that the number in the filenames are represented as strings, which are completely different from real numeric integers, and thus will be sorted differently:


In [ ]:
for fn in sorted(os.listdir('data/arabian_nights/')):
    print(fn)

Note: There is a more elegant, but also slightly less trivial way to achieve the correct order in this case:


In [ ]:
for fn in sorted(os.listdir('data/arabian_nights/'),
                key=lambda nb: int(nb[:-4])):
    print(fn)

Should you be interested: here, we pass a key argument to sort, which specifies which operations should be applied to the filenames before actually sorting them. Here, we specify a so-called lambda function to key, which is less intuitive to read, but which allow you to specify a sort of 'mini-function' in a very condensed way: this lambda function chops off the last four characters from each filename and then converts (or 'casts') the results to a new data type using int(), namely an integer (a 'whole' number, as opposed to floating point numbers). Eventually, this leads to the same order.

More functions

So far, we have been using pre-existing, ready-made functions from Python's standard library, or the standard set of functionality which comes with the programming language. Importantly, there are two additional ways of using functions on your code, which we will cover below: (i) you can write your own functions, and (ii) you can use functions from other, external libraries, which have been developped by so-called 'third parties'. Below, we will for instance use plotting functions from matplotlib, which is a common visualization library for Python.

At this point, we have an efficient way of looping over the Arabian Nights sequentially. What we still lack, are functions to load and clean our data. As you could see above, our files still contain a lot of punctuation marks etc., which are perhaps less interesting from the point of view of textual analysis. Let us write a simple function that takes a string as input, and returns a cleaner version of it, where all characters are lowercased, and only alphabetic characters are kept:


In [ ]:
import re
def preprocess(in_str):
    out_str = ''
    for c in in_str.lower():
        if c.isalpha() or c.isspace():
            out_str += c
    whitespace = re.compile(r'\s+')
    out_str = whitespace.sub(' ', out_str)
    return out_str

This code reviews some of the materials from previous chapters, including the use of a regular expression, which converts all consecutive instances of whitespace (including line breaks, for instance) to a single space. After executing the previous code block, we can now test our function:


In [ ]:
old_str = 'This;     is -- a very    DIRTY string!'
new_str = preprocess(old_str)
print(new_str)

We can now apply this function to the contents from a random night:


In [ ]:
with open('data/1001/0007.txt', 'r') as f:
    in_str = f.read()
print(preprocess(in_str))

This text looks cleaner already! We can now start to extract individual tokens from the text and count them. This process is called tokenization. Here, we make the naive assumption that words are simply space-free alphabetic strings -- which is of course wrong in the case of English words like "can't". Note that for many languages there exist better tokenizers in Python (such as the ones in the Natural Language Toolkit (nltk). We suffice with a simpler approach for now:


In [ ]:
def tokenize(in_str):
    tokens = in_str.split()
    tokens = [t for t in tokens if t]
    return tokens

Using the list comprehension, we make sure that we do not accidentally return empty strings as a token, for instance, at the beginning of a text which starts with a newline. Remember that anything in Python with a length of 0, will evaluate to False, which explains the if t in the comprehension: empty strings will fail this condition. We can start stacking our functions now:


In [ ]:
with open('data/1001/0007.txt', 'r') as f:
    in_str = f.read()
tokens = tokenize(preprocess(in_str))
print(tokens[:10])

We can now start analyzing our nights. A good start would be to check the length of each night in words:


In [ ]:
print(len(tokens))

Quiz

Iterate over all the nights in 1001 in a sorted way. Open, preprocess and tokenize each text. Store in a list called word_counts how many words each story has.


In [ ]:
# your quiz code

We now have a list of numbers, which we can plot over time. We will cover plotting more extensively in one of the next chapters. The things below are just a teaser. Start by importing matplotlib, which is imported as follows by convention:


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

The second line is needed to make sure that the plots will properly show up in our notebook. Let us start with a simple visualization:


In [ ]:
plt.plot(word_counts)

As you can see, this simple command can be used to quickly obtain a visualization that shows interesting trends. On the y-axis, we plot absolute word counts for each of our nights. The x-axis is figured out automatically by matplotlib and adds an index on the horizontal x-axis. Implicitly, it interprets our command as follows:


In [ ]:
plt.plot(range(0, len(word_counts)), word_counts)

When plt.plot receives two flat lists as arguments, it plots the first along the x-axis, and the second along the y-axis. If it only receives one list, it plots it along the y-axis and uses the range we now (redundantly) specified here for the x-axis. This is in fact a subtoptimal plot, since the index of the first data point we plot is zero, although the name of the first night is '1.txt'. Additionally, we know that there are some nights missing in our data. To set this straight, we could pass in our own x-coordinates as follows:


In [ ]:
filenames = sorted(os.listdir('data/1001'))
idxs = [int(i[:-4]) for i in filenames]
print(idxs[:20])
print(min(idxs))
print(max(idxs))

We can now make our plot more truthful, and add some bells and whistles:


In [ ]:
plt.plot(idxs, word_counts, color='r')
plt.xlabel('Word length')
plt.ylabel('# words (absolute counts)')
plt.title('The Arabian Nights')
plt.xlim(1, 1001)

Quiz

Using axvline() you can add vertical lines to a plot, for instance at position:


In [ ]:
plt.plot(idxs, word_counts, color='r')
plt.xlabel('Word length')
plt.ylabel('# words (absolute counts)')
plt.title(r'The Arabian Nights')
plt.xlim(1, 1001)
plt.axvline(500, color='g')

Write code that plots the position of the missing nights using this function (and blue lines).


In [ ]:
# quiz code goes here

Right now, we are visualizing texts, but we might also be interested in the vocabulary used in the story collection. Counting how often a word appears in a text is trivial for you right now with custom code, for instance:


In [ ]:
cnts = {}
for word in tokens:
    if word in cnts:
        cnts[word] += 1
    else:
        cnts[word] = 1
print(cnts)

One interesting item which you can use for counting in Python is the Counter object, which we can import as follows:


In [ ]:
from collections import Counter

This Counter makes it much easier to write code for counting. Below you can see how this counter automatically creates a dictionary-like structure:


In [ ]:
cnt = Counter(tokens)
print(cnt)

If we would like to find which items are most frequent for instance, we could simply do:


In [ ]:
print(cnt.most_common(25))

We can also pass the Counter the tokens to count in multiple stages:


In [ ]:
cnt = Counter()
cnt.update(tokens)
cnt.update(tokens)
print(cnt.most_common(25))

After passing our tokens twice to the counter, we see that the numbers double in size.

Quiz

Write code that makes a word frequency counter named vocab, which counts the cumulative frequencies of all words in the Arabian Nights. Which are the 15 most frequent words? Does that make sense?


In [ ]:
# quiz code

Let us now finally visualize the frequencies of the 15 most frequent items using a standard barplot in matplotlib. This can be achieved as follows. We first split out the names and frequencies, since .mostcommon(n) returns a list of tuples, and we create indices:


In [ ]:
freqs = [f for _, f in vocab.most_common(15)]
words = [w for w, _ in vocab.most_common(15)] # note the use of underscores for 'throwaway' variables
idxs = range(1, len(freqs)+1)

Next, we simply do:


In [ ]:
plt.barh(idxs, freqs, align='center')
plt.yticks(idxs, words)
plt.xlabel('Words')
plt.ylabel('Cumulative absolute frequencies')

Et voilà!

Closing Assignment

In this larger assignment, you will have to perform some basic text processing on the larger set of XML-encoded files under data/TEI/french_plays. For this assignment, there are several subtasks:

  1. Each of these files represent a play written by a particular author (see the <author> element): count how many texts were written by each author in the entire corpus. Make use of a Counter.
  2. Each play has a cast list (<castList>), with a role-element for every character in it. In this element, the civil-attribute encodes the gender of the character (M/F, or another charatcer ). Create for each individual author a barplot using matplotlib, showing the percentage of male, female and 'other' characters as a percentage. Pick beautiful colors.
  3. Difficult: The information contained in the castList is priceless, because it allows us to determine for each word in the play by whom it is uttered, since the <sp> tag encodes which character in the cast list is speaking at a particular time. Parse play 156.xml (L'Amour à la mode) and calculate which of the characters has the highest vocabulary richness: divide the number of unique words in the speaker's utterances by the total number of words (s)he utters. Only consider speakers that utter at least 1000 tokens in the play.

Hint: If your run into encoding errors etc. when processing larger text collections, you can always use try/except constructions to catch these.


Ignore the following, it's just here to make the page pretty:


In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()


Out[1]: